Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Installing the libraries with the specified version.
#!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 imbalanced-learn==0.10.1 xgboost==2.0.3 threadpoolctl==3.3.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Purpose: Calculate various discrete statistical values for a specific column in a DataFrame
#
# Prerequisites:
# Requires the developer to only send data that discrete statistics can safely be calculated for.
# This function would require more extensive data validation checks and more robust exception handling.
#
# Inputs
# data : DataFrame object containing rows and columns of data
# feature: str representing the column name to run statistics on
#
def calculate_statistics (data, feature):
# Only calculate and print statistics if the feature is a single column string name and data is a DataFrame
if isinstance(data,pd.DataFrame) and type(feature) == str:
# For future, would like to use Describe to pull data types for each column
# Then only perform the calculations and prints if of type Int64 or Float64
# Calculate and print various discrete statistical values
print(f"Discrete Statistics for {feature}\n")
print(f"Mean : {data[feature].mean():.6f}")
print(f"Mode : {data[feature].mode()[0]}")
print(f"Median : {data[feature].median()}")
print(f"Min : {data[feature].min()}")
print(f"Max : {data[feature].max()}")
print(f"Standard Deviation: {data[feature].std():.6f}")
print(f"Percentiles : \n{data[feature].quantile([.25,.50,.75])}")
# Provided by GreatLearning
# function to create histogram and boxplot; both are aligned by mean
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Provided by GreatLearning
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Provided by GreatLearning
# function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
# Provided by GreatLearning
# Display a stacked barplot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
#Outlier detection
def outlier_detection(data):
"""
Display a grid of box plots for each numeric feature; while showing the outlier data
data: dataframe
"""
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("case_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Purpose: To treat outliers by clipping them to the lower and upper whisker
#
# Inputs:
# df: Dataframe
# col: Feature that has outliers to treat
#
# Note: This procedure is being utilized from GreatLearning; Week 4 (Hands_on_Notebook_ExploratoryDataAnalysis)
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
# all the values smaller than lower_whisker will be assigned the value of lower_whisker
# all the values greater than upper_whisker will be assigned the value of upper_whisker
# the assignment will be done by using the clip function of NumPy
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
# Provided by GreatLearning
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Provided by GreatLearning
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
df = pd.read_csv("./train.csv")
df_test = pd.read_csv("./test.csv")
# Verify the data file was read correctly by displaying the first five rows.
df.head(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
# Verify the entire data file was read correctly by displaying the last five rows.
df.tail(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19995 | -2.071 | -1.088 | -0.796 | -3.012 | -2.288 | 2.807 | 0.481 | 0.105 | -0.587 | -2.899 | 8.868 | 1.717 | 1.358 | -1.777 | 0.710 | 4.945 | -3.100 | -1.199 | -1.085 | -0.365 | 3.131 | -3.948 | -3.578 | -8.139 | -1.937 | -1.328 | -0.403 | -1.735 | 9.996 | 6.955 | -3.938 | -8.274 | 5.745 | 0.589 | -0.650 | -3.043 | 2.216 | 0.609 | 0.178 | 2.928 | 1 |
| 19996 | 2.890 | 2.483 | 5.644 | 0.937 | -1.381 | 0.412 | -1.593 | -5.762 | 2.150 | 0.272 | -2.095 | -1.526 | 0.072 | -3.540 | -2.762 | -10.632 | -0.495 | 1.720 | 3.872 | -1.210 | -8.222 | 2.121 | -5.492 | 1.452 | 1.450 | 3.685 | 1.077 | -0.384 | -0.839 | -0.748 | -1.089 | -4.159 | 1.181 | -0.742 | 5.369 | -0.693 | -1.669 | 3.660 | 0.820 | -1.987 | 0 |
| 19997 | -3.897 | -3.942 | -0.351 | -2.417 | 1.108 | -1.528 | -3.520 | 2.055 | -0.234 | -0.358 | -3.782 | 2.180 | 6.112 | 1.985 | -8.330 | -1.639 | -0.915 | 5.672 | -3.924 | 2.133 | -4.502 | 2.777 | 5.728 | 1.620 | -1.700 | -0.042 | -2.923 | -2.760 | -2.254 | 2.552 | 0.982 | 7.112 | 1.476 | -3.954 | 1.856 | 5.029 | 2.083 | -6.409 | 1.477 | -0.874 | 0 |
| 19998 | -3.187 | -10.052 | 5.696 | -4.370 | -5.355 | -1.873 | -3.947 | 0.679 | -2.389 | 5.457 | 1.583 | 3.571 | 9.227 | 2.554 | -7.039 | -0.994 | -9.665 | 1.155 | 3.877 | 3.524 | -7.015 | -0.132 | -3.446 | -4.801 | -0.876 | -3.812 | 5.422 | -3.732 | 0.609 | 5.256 | 1.915 | 0.403 | 3.164 | 3.752 | 8.530 | 8.451 | 0.204 | -7.130 | 4.249 | -6.112 | 0 |
| 19999 | -2.687 | 1.961 | 6.137 | 2.600 | 2.657 | -4.291 | -2.344 | 0.974 | -1.027 | 0.497 | -9.589 | 3.177 | 1.055 | -1.416 | -4.669 | -5.405 | 3.720 | 2.893 | 2.329 | 1.458 | -6.429 | 1.818 | 0.806 | 7.786 | 0.331 | 5.257 | -4.867 | -0.819 | -5.667 | -2.861 | 4.674 | 6.621 | -1.989 | -1.349 | 3.952 | 5.450 | -0.455 | -2.202 | 1.678 | -1.974 | 0 |
# Verify the data file was read correctly by displaying the first five rows.
df_test.head(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613 | -3.820 | 2.202 | 1.300 | -1.185 | -4.496 | -1.836 | 4.723 | 1.206 | -0.342 | -5.123 | 1.017 | 4.819 | 3.269 | -2.984 | 1.387 | 2.032 | -0.512 | -1.023 | 7.339 | -2.242 | 0.155 | 2.054 | -2.772 | 1.851 | -1.789 | -0.277 | -1.255 | -3.833 | -1.505 | 1.587 | 2.291 | -5.411 | 0.870 | 0.574 | 4.157 | 1.428 | -10.511 | 0.455 | -1.448 | 0 |
| 1 | 0.390 | -0.512 | 0.527 | -2.577 | -1.017 | 2.235 | -0.441 | -4.406 | -0.333 | 1.967 | 1.797 | 0.410 | 0.638 | -1.390 | -1.883 | -5.018 | -3.827 | 2.418 | 1.762 | -3.242 | -3.193 | 1.857 | -1.708 | 0.633 | -0.588 | 0.084 | 3.014 | -0.182 | 0.224 | 0.865 | -1.782 | -2.475 | 2.494 | 0.315 | 2.059 | 0.684 | -0.485 | 5.128 | 1.721 | -1.488 | 0 |
| 2 | -0.875 | -0.641 | 4.084 | -1.590 | 0.526 | -1.958 | -0.695 | 1.347 | -1.732 | 0.466 | -4.928 | 3.565 | -0.449 | -0.656 | -0.167 | -1.630 | 2.292 | 2.396 | 0.601 | 1.794 | -2.120 | 0.482 | -0.841 | 1.790 | 1.874 | 0.364 | -0.169 | -0.484 | -2.119 | -2.157 | 2.907 | -1.319 | -2.997 | 0.460 | 0.620 | 5.632 | 1.324 | -1.752 | 1.808 | 1.676 | 0 |
| 3 | 0.238 | 1.459 | 4.015 | 2.534 | 1.197 | -3.117 | -0.924 | 0.269 | 1.322 | 0.702 | -5.578 | -0.851 | 2.591 | 0.767 | -2.391 | -2.342 | 0.572 | -0.934 | 0.509 | 1.211 | -3.260 | 0.105 | -0.659 | 1.498 | 1.100 | 4.143 | -0.248 | -1.137 | -5.356 | -4.546 | 3.809 | 3.518 | -3.074 | -0.284 | 0.955 | 3.029 | -1.367 | -3.412 | 0.906 | -2.451 | 0 |
| 4 | 5.828 | 2.768 | -1.235 | 2.809 | -1.642 | -1.407 | 0.569 | 0.965 | 1.918 | -2.775 | -0.530 | 1.375 | -0.651 | -1.679 | -0.379 | -4.443 | 3.894 | -0.608 | 2.945 | 0.367 | -5.789 | 4.598 | 4.450 | 3.225 | 0.397 | 0.248 | -2.362 | 1.079 | -0.473 | 2.243 | -3.591 | 1.774 | -1.502 | -2.227 | 4.777 | -6.560 | -0.806 | -0.276 | -3.858 | -0.538 | 0 |
# Verify the entire data file was read correctly by displaying the last five rows.
df_test.tail(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | -5.120 | 1.635 | 1.251 | 4.036 | 3.291 | -2.932 | -1.329 | 1.754 | -2.985 | 1.249 | -6.878 | 3.715 | -2.512 | -1.395 | -2.554 | -2.197 | 4.772 | 2.403 | 3.792 | 0.487 | -2.028 | 1.778 | 3.668 | 11.375 | -1.977 | 2.252 | -7.319 | 1.907 | -3.734 | -0.012 | 2.120 | 9.979 | 0.063 | 0.217 | 3.036 | 2.109 | -0.557 | 1.939 | 0.513 | -2.694 | 0 |
| 4996 | -5.172 | 1.172 | 1.579 | 1.220 | 2.530 | -0.669 | -2.618 | -2.001 | 0.634 | -0.579 | -3.671 | 0.460 | 3.321 | -1.075 | -7.113 | -4.356 | -0.001 | 3.698 | -0.846 | -0.222 | -3.645 | 0.736 | 0.926 | 3.278 | -2.277 | 4.458 | -4.543 | -1.348 | -1.779 | 0.352 | -0.214 | 4.424 | 2.604 | -2.152 | 0.917 | 2.157 | 0.467 | 0.470 | 2.197 | -2.377 | 0 |
| 4997 | -1.114 | -0.404 | -1.765 | -5.879 | 3.572 | 3.711 | -2.483 | -0.308 | -0.922 | -2.999 | -0.112 | -1.977 | -1.623 | -0.945 | -2.735 | -0.813 | 0.610 | 8.149 | -9.199 | -3.872 | -0.296 | 1.468 | 2.884 | 2.792 | -1.136 | 1.198 | -4.342 | -2.869 | 4.124 | 4.197 | 3.471 | 3.792 | 7.482 | -10.061 | -0.387 | 1.849 | 1.818 | -1.246 | -1.261 | 7.475 | 0 |
| 4998 | -1.703 | 0.615 | 6.221 | -0.104 | 0.956 | -3.279 | -1.634 | -0.104 | 1.388 | -1.066 | -7.970 | 2.262 | 3.134 | -0.486 | -3.498 | -4.562 | 3.136 | 2.536 | -0.792 | 4.398 | -4.073 | -0.038 | -2.371 | -1.542 | 2.908 | 3.215 | -0.169 | -1.541 | -4.724 | -5.525 | 1.668 | -4.100 | -5.949 | 0.550 | -1.574 | 6.824 | 2.139 | -4.036 | 3.436 | 0.579 | 0 |
| 4999 | -0.604 | 0.960 | -0.721 | 8.230 | -1.816 | -2.276 | -2.575 | -1.041 | 4.130 | -2.731 | -3.292 | -1.674 | 0.465 | -1.646 | -5.263 | -7.988 | 6.480 | 0.226 | 4.963 | 6.752 | -6.306 | 3.271 | 1.897 | 3.271 | -0.637 | -0.925 | -6.759 | 2.990 | -0.814 | 3.499 | -8.435 | 2.370 | -1.062 | 0.791 | 4.952 | -7.441 | -0.070 | -0.918 | -2.291 | -5.363 | 0 |
# Let's make a copy of our data sets
data = df.copy()
data_test = df_test.copy()
#Check the size of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} features in the data frame.")
There are 20000 rows and 41 features in the data frame.
#Check the size of the data
print(f"There are {data_test.shape[0]} rows and {data.shape[1]} features in the test data frame.")
There are 5000 rows and 41 features in the test data frame.
# let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
#Show the statistical summary of the data
data.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.000 | -0.272 | 3.442 | -11.876 | -2.737 | -0.748 | 1.840 | 15.493 |
| V2 | 19982.000 | 0.440 | 3.151 | -12.320 | -1.641 | 0.472 | 2.544 | 13.089 |
| V3 | 20000.000 | 2.485 | 3.389 | -10.708 | 0.207 | 2.256 | 4.566 | 17.091 |
| V4 | 20000.000 | -0.083 | 3.432 | -15.082 | -2.348 | -0.135 | 2.131 | 13.236 |
| V5 | 20000.000 | -0.054 | 2.105 | -8.603 | -1.536 | -0.102 | 1.340 | 8.134 |
| V6 | 20000.000 | -0.995 | 2.041 | -10.227 | -2.347 | -1.001 | 0.380 | 6.976 |
| V7 | 20000.000 | -0.879 | 1.762 | -7.950 | -2.031 | -0.917 | 0.224 | 8.006 |
| V8 | 20000.000 | -0.548 | 3.296 | -15.658 | -2.643 | -0.389 | 1.723 | 11.679 |
| V9 | 20000.000 | -0.017 | 2.161 | -8.596 | -1.495 | -0.068 | 1.409 | 8.138 |
| V10 | 20000.000 | -0.013 | 2.193 | -9.854 | -1.411 | 0.101 | 1.477 | 8.108 |
| V11 | 20000.000 | -1.895 | 3.124 | -14.832 | -3.922 | -1.921 | 0.119 | 11.826 |
| V12 | 20000.000 | 1.605 | 2.930 | -12.948 | -0.397 | 1.508 | 3.571 | 15.081 |
| V13 | 20000.000 | 1.580 | 2.875 | -13.228 | -0.224 | 1.637 | 3.460 | 15.420 |
| V14 | 20000.000 | -0.951 | 1.790 | -7.739 | -2.171 | -0.957 | 0.271 | 5.671 |
| V15 | 20000.000 | -2.415 | 3.355 | -16.417 | -4.415 | -2.383 | -0.359 | 12.246 |
| V16 | 20000.000 | -2.925 | 4.222 | -20.374 | -5.634 | -2.683 | -0.095 | 13.583 |
| V17 | 20000.000 | -0.134 | 3.345 | -14.091 | -2.216 | -0.015 | 2.069 | 16.756 |
| V18 | 20000.000 | 1.189 | 2.592 | -11.644 | -0.404 | 0.883 | 2.572 | 13.180 |
| V19 | 20000.000 | 1.182 | 3.397 | -13.492 | -1.050 | 1.279 | 3.493 | 13.238 |
| V20 | 20000.000 | 0.024 | 3.669 | -13.923 | -2.433 | 0.033 | 2.512 | 16.052 |
| V21 | 20000.000 | -3.611 | 3.568 | -17.956 | -5.930 | -3.533 | -1.266 | 13.840 |
| V22 | 20000.000 | 0.952 | 1.652 | -10.122 | -0.118 | 0.975 | 2.026 | 7.410 |
| V23 | 20000.000 | -0.366 | 4.032 | -14.866 | -3.099 | -0.262 | 2.452 | 14.459 |
| V24 | 20000.000 | 1.134 | 3.912 | -16.387 | -1.468 | 0.969 | 3.546 | 17.163 |
| V25 | 20000.000 | -0.002 | 2.017 | -8.228 | -1.365 | 0.025 | 1.397 | 8.223 |
| V26 | 20000.000 | 1.874 | 3.435 | -11.834 | -0.338 | 1.951 | 4.130 | 16.836 |
| V27 | 20000.000 | -0.612 | 4.369 | -14.905 | -3.652 | -0.885 | 2.189 | 17.560 |
| V28 | 20000.000 | -0.883 | 1.918 | -9.269 | -2.171 | -0.891 | 0.376 | 6.528 |
| V29 | 20000.000 | -0.986 | 2.684 | -12.579 | -2.787 | -1.176 | 0.630 | 10.722 |
| V30 | 20000.000 | -0.016 | 3.005 | -14.796 | -1.867 | 0.184 | 2.036 | 12.506 |
| V31 | 20000.000 | 0.487 | 3.461 | -13.723 | -1.818 | 0.490 | 2.731 | 17.255 |
| V32 | 20000.000 | 0.304 | 5.500 | -19.877 | -3.420 | 0.052 | 3.762 | 23.633 |
| V33 | 20000.000 | 0.050 | 3.575 | -16.898 | -2.243 | -0.066 | 2.255 | 16.692 |
| V34 | 20000.000 | -0.463 | 3.184 | -17.985 | -2.137 | -0.255 | 1.437 | 14.358 |
| V35 | 20000.000 | 2.230 | 2.937 | -15.350 | 0.336 | 2.099 | 4.064 | 15.291 |
| V36 | 20000.000 | 1.515 | 3.801 | -14.833 | -0.944 | 1.567 | 3.984 | 19.330 |
| V37 | 20000.000 | 0.011 | 1.788 | -5.478 | -1.256 | -0.128 | 1.176 | 7.467 |
| V38 | 20000.000 | -0.344 | 3.948 | -17.375 | -2.988 | -0.317 | 2.279 | 15.290 |
| V39 | 20000.000 | 0.891 | 1.753 | -6.439 | -0.272 | 0.919 | 2.058 | 7.760 |
| V40 | 20000.000 | -0.876 | 3.012 | -11.024 | -2.940 | -0.921 | 1.120 | 10.654 |
| Target | 20000.000 | 0.056 | 0.229 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
data.nunique()
V1 19982 V2 19982 V3 20000 V4 20000 V5 20000 V6 20000 V7 20000 V8 20000 V9 20000 V10 20000 V11 20000 V12 20000 V13 20000 V14 20000 V15 20000 V16 20000 V17 20000 V18 20000 V19 20000 V20 20000 V21 20000 V22 20000 V23 20000 V24 20000 V25 20000 V26 20000 V27 20000 V28 20000 V29 20000 V30 20000 V31 20000 V32 20000 V33 20000 V34 20000 V35 20000 V36 20000 V37 20000 V38 20000 V39 20000 V40 20000 Target 2 dtype: int64
# Check for missing values.
test_results = data.isnull().sum()
test_results[test_results>0]
V1 18 V2 18 dtype: int64
#Let's investigate the rows that have a missing V1 value
data[data['V1'].isnull()]
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | NaN | -3.961 | 2.788 | -4.713 | -3.007 | -1.541 | -0.881 | 1.477 | 0.575 | -1.101 | -1.847 | 4.541 | 4.490 | 0.710 | -2.138 | -2.026 | 0.136 | 2.792 | -1.167 | 4.870 | -3.924 | 1.493 | -0.173 | -6.471 | 3.008 | -3.134 | 3.956 | -1.898 | -0.642 | -0.538 | -1.876 | -8.326 | -5.141 | 1.121 | -0.306 | 5.315 | 3.750 | -5.631 | 2.372 | 2.196 | 0 |
| 5941 | NaN | 1.008 | 1.228 | 5.397 | 0.064 | -2.707 | -2.028 | 0.534 | 3.007 | -2.362 | -5.713 | -1.620 | -0.046 | -0.511 | -3.030 | -4.996 | 6.425 | 0.773 | 1.235 | 5.860 | -3.851 | 1.707 | 1.016 | 2.310 | 1.162 | 0.388 | -4.908 | 1.453 | -2.539 | -0.518 | -2.749 | 1.870 | -3.115 | -0.550 | 1.714 | -2.257 | 0.411 | -3.434 | -1.299 | -1.769 | 0 |
| 6317 | NaN | -5.205 | 1.998 | -3.708 | -1.042 | -1.593 | -2.653 | 0.852 | -1.310 | 2.407 | -2.696 | 3.517 | 6.080 | 1.893 | -6.296 | -2.354 | -3.713 | 4.059 | -0.373 | 1.624 | -5.273 | 2.433 | 2.354 | 0.062 | -0.469 | -1.308 | 1.865 | -2.446 | -2.908 | 1.166 | 1.492 | 3.074 | -0.068 | -0.278 | 3.197 | 7.016 | 1.302 | -4.580 | 2.956 | -2.363 | 0 |
| 6464 | NaN | 2.146 | 5.004 | 4.192 | 1.428 | -6.438 | -0.931 | 3.794 | -0.683 | -0.739 | -8.189 | 6.676 | 4.109 | -0.653 | -4.763 | -1.715 | 4.042 | -0.464 | 4.026 | 3.830 | -5.310 | 0.926 | 2.933 | 4.457 | -0.354 | 4.864 | -5.043 | -0.770 | -5.669 | -2.644 | 1.855 | 5.231 | -5.113 | 1.746 | 2.587 | 3.991 | 0.611 | -4.273 | 1.865 | -3.599 | 0 |
| 7073 | NaN | 2.534 | 2.763 | -1.674 | -1.942 | -0.030 | 0.911 | -3.200 | 2.949 | -0.413 | 0.013 | -0.483 | 2.908 | -0.942 | -0.655 | -6.153 | -2.604 | -0.674 | 0.767 | -2.704 | -6.404 | 2.858 | -1.414 | -2.859 | 2.362 | 3.168 | 5.590 | -1.769 | -2.734 | -3.304 | -0.201 | -4.887 | -2.612 | -1.501 | 2.036 | -0.829 | -1.370 | 0.572 | -0.132 | -0.322 | 0 |
| 8431 | NaN | -1.399 | -2.008 | -1.750 | 0.932 | -1.290 | -0.270 | 4.459 | -2.776 | -1.212 | -2.049 | 5.283 | -0.872 | 0.068 | -0.667 | 1.865 | 3.443 | 3.297 | -0.930 | 0.944 | -0.558 | 2.547 | 6.471 | 4.467 | -0.811 | -2.225 | -3.844 | 0.170 | 0.232 | 2.963 | 0.415 | 4.560 | -0.421 | -2.037 | 1.110 | 1.521 | 2.114 | -2.253 | -0.939 | 2.542 | 0 |
| 8439 | NaN | -3.841 | 0.197 | 4.148 | 1.151 | -0.993 | -4.732 | 0.559 | -0.927 | 0.458 | -4.889 | -1.247 | -1.653 | -0.235 | -5.407 | -2.989 | 4.834 | 4.638 | 1.297 | 6.399 | -1.092 | 0.134 | 0.410 | 6.207 | -1.939 | -2.996 | -8.530 | 2.124 | 0.821 | 4.871 | -2.013 | 6.819 | 3.451 | 0.242 | 3.216 | 1.203 | 1.275 | -1.921 | 0.579 | -2.838 | 0 |
| 11156 | NaN | -0.667 | 3.716 | 4.934 | 1.668 | -4.356 | -2.823 | 0.373 | -0.710 | 2.177 | -8.808 | 2.562 | 1.959 | 0.005 | -5.940 | -4.676 | 3.292 | 1.975 | 4.434 | 4.713 | -4.124 | 1.048 | 0.859 | 6.753 | -0.812 | 1.876 | -4.789 | 1.248 | -6.278 | -2.253 | 0.464 | 6.663 | -2.898 | 3.068 | 2.487 | 4.809 | 0.069 | -1.216 | 3.014 | -5.973 | 0 |
| 11287 | NaN | -2.562 | -0.181 | -7.195 | -1.044 | 1.385 | 1.306 | 1.559 | -2.992 | 1.275 | 3.033 | 3.689 | 0.522 | 0.753 | 2.457 | 3.192 | -4.054 | 1.523 | -2.112 | -3.494 | 0.554 | 0.755 | 1.150 | -2.128 | 0.731 | -2.165 | 5.066 | -2.036 | 1.563 | 0.856 | 3.188 | -2.532 | 0.560 | -1.154 | -0.019 | 4.065 | 0.979 | -0.571 | 0.630 | 3.919 | 0 |
| 11456 | NaN | 1.300 | 4.383 | 1.583 | -0.077 | 0.659 | -1.639 | -4.815 | -0.915 | 2.812 | 0.572 | -0.319 | 0.853 | -2.777 | -3.633 | -5.402 | -4.239 | 0.261 | 5.218 | -3.446 | -4.544 | -0.524 | -5.112 | 3.633 | -2.315 | 4.270 | -0.810 | -0.532 | 0.693 | 1.787 | 0.724 | 1.772 | 5.755 | 1.204 | 5.664 | 0.414 | -2.644 | 5.530 | 2.105 | -4.945 | 0 |
| 12221 | NaN | -2.326 | -0.052 | 0.615 | -0.896 | -2.437 | 0.350 | 2.093 | -2.934 | 2.291 | -3.838 | 6.294 | -1.584 | 0.012 | 0.547 | -0.998 | 3.333 | 1.319 | 5.203 | 3.560 | -0.647 | 2.200 | 2.725 | 4.346 | 0.560 | -4.238 | -0.249 | 2.953 | -3.262 | -0.752 | -2.262 | 0.135 | -5.183 | 5.252 | 0.716 | 3.211 | 1.642 | 1.544 | 1.805 | -2.040 | 0 |
| 12447 | NaN | 0.753 | -0.271 | 1.301 | 2.039 | -1.485 | -0.412 | 0.981 | 0.810 | -0.065 | -3.844 | -1.009 | 1.098 | 1.431 | -1.497 | 0.018 | 1.403 | 0.469 | -2.055 | 0.628 | 0.045 | 0.566 | 2.473 | 1.881 | 0.200 | 1.757 | -1.190 | -0.288 | -3.974 | -3.101 | 2.092 | 4.410 | -2.209 | -1.359 | -1.726 | 1.679 | -0.209 | -2.336 | 0.112 | -0.543 | 0 |
| 13086 | NaN | 2.056 | 3.331 | 2.741 | 2.783 | -0.444 | -2.015 | -0.887 | -1.111 | 0.025 | -2.753 | -1.148 | -1.543 | -2.020 | -2.344 | -1.388 | 1.272 | 1.224 | 0.750 | -0.925 | -0.823 | -1.865 | -2.626 | 5.158 | -1.809 | 4.433 | -5.879 | -0.431 | 0.966 | 1.189 | 3.295 | 5.112 | 4.675 | -1.710 | 2.430 | 0.997 | -1.191 | 1.207 | 0.511 | -0.884 | 0 |
| 13411 | NaN | 2.705 | 4.587 | 1.868 | 2.050 | -0.925 | -1.669 | -1.654 | -0.243 | -0.317 | -2.224 | 0.258 | 1.562 | -2.228 | -3.846 | -2.398 | -0.656 | 0.637 | 1.076 | -1.443 | -2.758 | -1.739 | -3.150 | 2.459 | -1.692 | 6.165 | -3.977 | -1.734 | 0.289 | 0.199 | 2.580 | 2.527 | 3.625 | -1.200 | 2.328 | 1.667 | -0.943 | 0.947 | 1.655 | -1.665 | 0 |
| 14202 | NaN | 7.039 | 2.145 | -3.202 | 4.113 | 3.376 | -1.337 | -4.546 | 1.941 | -5.467 | 2.364 | -1.338 | 3.052 | -4.598 | -6.043 | -4.133 | -2.799 | 4.435 | -6.633 | -8.543 | -4.267 | -0.383 | -1.141 | -0.153 | -3.116 | 11.244 | -5.046 | -5.440 | 5.035 | 2.808 | 1.920 | 0.158 | 9.768 | -10.258 | 0.514 | -1.975 | -0.029 | 3.127 | 0.009 | 4.538 | 0 |
| 15520 | NaN | 1.383 | 3.237 | -3.818 | -1.917 | 0.438 | 1.348 | -2.036 | 1.156 | 0.307 | 2.234 | 0.628 | 3.356 | -0.483 | 0.548 | -2.162 | -5.072 | -1.413 | -0.092 | -3.925 | -4.032 | 0.784 | -2.563 | -4.674 | 1.767 | 2.998 | 6.633 | -2.927 | -0.687 | -2.376 | 2.066 | -5.415 | -0.897 | -1.058 | 1.417 | 1.162 | -1.147 | -0.048 | 0.605 | 0.815 | 0 |
| 16576 | NaN | 3.934 | -0.762 | 2.652 | 1.754 | -0.554 | 1.829 | -0.105 | -3.737 | 1.037 | -0.359 | 5.859 | -4.206 | -3.349 | 1.476 | -0.451 | 2.342 | -0.376 | 6.431 | -3.529 | 0.458 | 0.970 | 2.185 | 8.724 | -2.764 | 1.919 | -4.303 | 2.849 | -0.029 | 1.116 | -1.477 | 3.486 | 1.028 | 2.846 | 1.744 | -2.000 | -0.783 | 8.698 | 0.352 | -2.005 | 0 |
| 18104 | NaN | 1.492 | 2.659 | 0.223 | -0.304 | -1.347 | 0.044 | -0.159 | 1.108 | -0.573 | -2.281 | 0.316 | 1.005 | -0.495 | -0.360 | -2.629 | 0.661 | -0.311 | 0.490 | 0.092 | -3.322 | 1.033 | -0.598 | -0.154 | 1.547 | 2.155 | 0.984 | -0.863 | -2.067 | -2.184 | 1.339 | -1.007 | -2.230 | -0.871 | 1.300 | 0.668 | -0.503 | -1.485 | -0.154 | 0.157 | 0 |
#Let's investigate the rows that have a missing V2 value
data[data['V2'].isnull()]
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 613 | -2.049 | NaN | -1.624 | -3.324 | 0.152 | 0.600 | -1.813 | 0.852 | -1.523 | 0.211 | -0.460 | 2.380 | 1.676 | 0.529 | -3.768 | -1.096 | -0.785 | 4.855 | -1.961 | 0.047 | -2.195 | 2.567 | 3.988 | 2.068 | -1.312 | -2.227 | -1.315 | -0.934 | 0.535 | 3.590 | -0.471 | 3.264 | 2.379 | -2.457 | 1.719 | 2.537 | 1.702 | -1.435 | 0.597 | 0.739 | 0 |
| 2236 | -3.761 | NaN | 0.195 | -1.638 | 1.261 | -1.574 | -3.686 | 1.576 | -0.310 | -0.138 | -4.495 | 1.817 | 5.029 | 1.437 | -8.109 | -2.803 | -0.187 | 5.801 | -3.025 | 2.019 | -5.083 | 3.033 | 5.197 | 3.117 | -1.580 | 0.259 | -3.535 | -2.270 | -2.474 | 2.470 | 1.162 | 7.621 | 1.695 | -3.956 | 2.708 | 4.657 | 1.619 | -5.537 | 1.247 | -1.163 | 0 |
| 2508 | -1.431 | NaN | 0.660 | -2.876 | 1.150 | -0.786 | -1.560 | 2.899 | -2.347 | -0.218 | -1.131 | 2.931 | 2.053 | 0.375 | -3.123 | 1.321 | -1.053 | 3.188 | -2.288 | -1.314 | -2.461 | 1.292 | 3.694 | 3.003 | -1.523 | 0.904 | -2.650 | -2.502 | 0.678 | 3.295 | 3.915 | 6.279 | 3.324 | -4.048 | 3.119 | 3.336 | 0.604 | -3.782 | -0.157 | 1.503 | 0 |
| 4653 | 5.466 | NaN | 4.541 | -2.917 | 0.400 | 2.799 | 0.029 | -7.334 | 1.123 | 1.695 | 1.165 | -2.778 | 0.571 | -3.078 | -1.388 | -8.513 | -6.208 | 1.401 | 0.769 | -9.145 | -6.873 | 2.065 | -4.812 | 1.897 | 0.338 | 7.160 | 4.653 | -2.619 | -1.107 | -2.284 | 3.652 | -1.536 | 4.596 | -4.104 | 4.296 | 0.153 | -3.727 | 6.563 | 0.706 | -0.462 | 0 |
| 6810 | -2.631 | NaN | 2.330 | 1.090 | 0.604 | -1.139 | -0.690 | -1.359 | 0.356 | -1.189 | -1.703 | 3.141 | 2.523 | -2.171 | -3.983 | -3.457 | 0.497 | 1.160 | 1.968 | 0.019 | -3.499 | 0.381 | -0.338 | 0.911 | -1.197 | 3.694 | -2.561 | -0.729 | -0.450 | 0.165 | -1.960 | -0.950 | 0.210 | 0.449 | 1.046 | 0.537 | 0.763 | 1.729 | 1.886 | -1.702 | 0 |
| 7788 | -4.203 | NaN | 2.954 | 0.584 | 4.104 | -0.639 | -2.811 | -0.112 | -1.363 | -0.800 | -1.392 | 0.420 | 3.812 | -1.782 | -7.549 | -1.170 | -3.184 | 2.585 | -1.856 | -5.779 | -4.962 | -0.045 | 1.937 | 6.762 | -4.828 | 9.171 | -7.403 | -4.276 | 0.950 | 3.959 | 6.185 | 12.522 | 9.502 | -7.153 | 5.669 | 1.250 | -2.159 | -0.954 | -0.002 | -1.547 | 0 |
| 8483 | -4.484 | NaN | 1.201 | -2.042 | 2.779 | -0.802 | -5.404 | -1.225 | 1.486 | -0.974 | -5.913 | -0.329 | 7.565 | 0.805 | -12.687 | -7.009 | -1.561 | 8.508 | -5.537 | 0.200 | -8.388 | 4.009 | 5.066 | 3.765 | -2.405 | 4.073 | -4.742 | -4.100 | -3.459 | 2.146 | 1.662 | 9.467 | 4.281 | -7.588 | 3.267 | 5.232 | 1.279 | -5.371 | 1.984 | -1.643 | 0 |
| 8894 | 3.264 | NaN | 8.447 | -3.253 | -3.418 | -2.996 | -0.669 | -0.161 | -0.667 | 3.134 | -2.112 | 3.735 | 5.746 | 0.330 | -1.831 | -3.277 | -5.365 | -1.125 | 3.783 | 0.579 | -7.446 | 0.403 | -4.710 | -3.815 | 2.681 | 1.785 | 7.026 | -3.364 | -3.217 | -2.715 | 4.555 | -4.243 | -3.123 | 2.522 | 5.284 | 7.291 | -0.868 | -4.315 | 3.124 | -2.393 | 0 |
| 8947 | -3.793 | NaN | 0.720 | 2.306 | 0.935 | -0.984 | 0.505 | -0.441 | -2.767 | 1.735 | -1.988 | 4.212 | -2.798 | -2.083 | 0.342 | -1.369 | 2.095 | 0.307 | 5.488 | -0.388 | 0.089 | 0.326 | 0.122 | 6.040 | -1.381 | 0.375 | -2.734 | 2.510 | -1.072 | -0.054 | -1.293 | 1.528 | -0.497 | 3.790 | 1.131 | 0.618 | -0.111 | 5.709 | 1.542 | -2.481 | 0 |
| 9362 | 2.662 | NaN | 2.980 | 4.431 | -0.238 | 0.672 | 0.380 | -7.647 | 4.435 | -0.746 | -1.169 | -3.067 | 0.025 | -3.767 | -1.931 | -10.298 | 0.341 | -1.307 | 4.457 | -2.175 | -5.360 | 1.257 | -5.030 | 0.454 | 0.703 | 6.003 | 0.909 | 1.180 | -2.527 | -4.018 | -4.607 | -5.494 | -1.105 | 1.225 | 0.976 | -4.794 | -2.269 | 7.671 | 0.825 | -3.929 | 0 |
| 9425 | -2.354 | NaN | 2.054 | 0.812 | 2.540 | -0.925 | -0.208 | -0.563 | -0.140 | -2.147 | -3.838 | 2.682 | -0.660 | -2.519 | -1.708 | -2.675 | 3.630 | 2.293 | -0.160 | -0.368 | -1.414 | 0.225 | 0.243 | 2.928 | -0.190 | 4.111 | -4.003 | -0.160 | -0.929 | -1.678 | -0.042 | -0.621 | -0.897 | -1.181 | -1.237 | 1.237 | 1.228 | 2.074 | 1.224 | 1.472 | 0 |
| 9848 | -1.764 | NaN | 2.845 | -2.753 | -0.812 | -0.101 | -1.382 | -1.105 | -0.054 | 0.160 | 0.640 | 2.035 | 4.863 | -0.351 | -4.249 | -1.557 | -3.843 | 1.644 | -0.471 | -0.326 | -3.334 | -0.352 | -1.690 | -3.143 | -0.703 | 1.791 | 1.293 | -2.779 | 0.840 | 1.251 | 0.264 | -2.159 | 1.860 | -0.337 | 1.509 | 3.408 | 0.923 | -1.503 | 2.515 | -0.794 | 0 |
| 11637 | -2.271 | NaN | 1.710 | 1.158 | -0.355 | -5.449 | -0.786 | 3.936 | -1.576 | 0.801 | -8.512 | 8.426 | 2.662 | 0.696 | -3.692 | -3.227 | 5.014 | 2.677 | 4.117 | 5.919 | -5.061 | 4.175 | 5.949 | 4.687 | 1.123 | -1.937 | -1.736 | 1.307 | -7.059 | -2.439 | -1.546 | 2.651 | -8.429 | 3.511 | 1.500 | 5.552 | 2.589 | -3.453 | 2.324 | -2.760 | 0 |
| 12339 | -1.664 | NaN | -0.712 | -4.347 | 1.392 | -0.094 | -2.163 | -0.381 | 0.031 | -0.659 | -5.653 | 2.888 | 2.208 | 0.552 | -5.221 | -5.363 | 2.142 | 8.083 | -4.127 | 1.704 | -3.908 | 4.500 | 4.886 | 2.087 | 0.979 | -1.480 | -0.362 | -0.818 | -3.844 | -1.256 | -1.122 | 0.307 | -2.691 | -3.112 | -1.596 | 5.821 | 3.462 | -1.737 | 2.291 | 2.241 | 0 |
| 15913 | 0.768 | NaN | 5.296 | 0.043 | -1.174 | -2.249 | 0.956 | -0.090 | -0.242 | -1.061 | -2.449 | 5.086 | 0.434 | -2.633 | 0.849 | -2.631 | 2.178 | -0.845 | 3.864 | 1.723 | -2.994 | -0.466 | -3.444 | -1.775 | 2.113 | 2.187 | 0.926 | -0.192 | -0.633 | -2.589 | -0.803 | -7.720 | -4.519 | 3.182 | 0.453 | 2.175 | 1.262 | 0.893 | 2.027 | 0.633 | 0 |
| 18342 | -0.929 | NaN | 2.376 | -1.237 | 3.229 | -2.100 | -2.190 | 0.589 | 1.956 | -5.008 | -7.388 | 3.314 | 3.774 | -1.836 | -7.099 | -6.071 | 4.892 | 6.479 | -4.841 | 0.968 | -6.694 | 3.470 | 4.668 | 2.432 | 0.399 | 5.752 | -5.572 | -2.882 | -2.986 | -1.455 | 0.333 | 1.613 | -1.821 | -6.665 | -0.455 | 3.055 | 2.935 | -3.791 | 0.863 | 3.336 | 0 |
| 18343 | -2.377 | NaN | -0.009 | -1.472 | 1.295 | 0.725 | -1.123 | -3.190 | 3.251 | -4.862 | -0.685 | 2.360 | 5.432 | -2.508 | -7.250 | -5.571 | 0.679 | 4.391 | -3.424 | -0.273 | -4.233 | 1.505 | 1.570 | -3.372 | -1.288 | 4.813 | -2.778 | -2.350 | 0.684 | 0.351 | -5.729 | -5.093 | 0.439 | -3.167 | -2.713 | -0.593 | 3.229 | 1.316 | 2.283 | 1.152 | 0 |
| 18907 | -0.119 | NaN | 3.658 | -1.232 | 1.947 | -0.119 | 0.652 | -1.490 | -0.034 | -2.557 | -2.094 | 2.939 | -0.489 | -3.372 | -0.236 | -2.676 | 1.934 | 1.647 | -0.603 | -2.326 | -1.779 | -0.466 | -2.086 | 0.333 | 0.671 | 5.423 | -1.576 | -1.345 | 0.404 | -2.333 | 0.960 | -4.670 | -0.594 | -1.651 | -1.405 | 1.531 | 1.079 | 2.833 | 1.451 | 3.233 | 0 |
# Check for missing values in the test data set
test_results = data_test.isnull().sum()
test_results[test_results>0]
V1 5 V2 6 dtype: int64
# Loop through all sensor features and create a histogram and boxplot for each feature.
for feature in data.columns:
histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None) ## Please change the dataframe name as you define while reading the data
Outlier data likely represents good data and will not be treated.
Some features have slight skewness:
# Let's barplot the Target feature from the training/validation data set.
labeled_barplot(data, feature="Target", perc=True)
# Let's barplot the Target feature from the test dataset
labeled_barplot(data_test, feature="Target", perc=True)
# Let's create multiple barplots of important features vs Target
cols = data[['V3','V15','V18','V36','V39']].columns.tolist()
plt.figure(figsize=(10,10))
# Loop through each important feature
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(x="Target",y=variable,data=data,palette="PuBu",showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
# Display the numeric fields in a heatmap to determine if there are any correlations between features
columns_of_interest = ['V3', 'V15', 'V18', 'V36', 'V39','Target']
plt.figure(figsize=(12, 7))
sns.heatmap(
data[columns_of_interest].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# Let's create a pair plot for those features of interests using Target as hue
columns_of_interest = ['V3', 'V15', 'V18', 'V36', 'V39','Target']
df_selected = data[columns_of_interest]
# Create a pair plot
sns.pairplot(df_selected, hue="Target")
<seaborn.axisgrid.PairGrid at 0x1c2fd71d0>
# Separating target variable and other variables
X = data.drop(columns="Target")
X = pd.get_dummies(X)
Y = data["Target"]
#Check the size of the data
print(f"There are {X.shape[0]} rows and {X.shape[1]} features in the data frame.")
There are 20000 rows and 40 features in the data frame.
# Let's now split the Data set into training and validation data
X_train, X_val, y_train, y_val = train_test_split(
X, Y, test_size=0.25, random_state=1, stratify=Y
)
#Check the size of the data
print(f"There are {X_train.shape[0]} rows and {X_train.shape[1]} features in the data frame.")
#Check the size of the data
print(f"There are {X_val.shape[0]} rows and {X_val.shape[1]} features in the data frame.")
There are 15000 rows and 40 features in the data frame. There are 5000 rows and 40 features in the data frame.
# Let's prepare the test data set now
# Separating target variable and other variables
X_test = data_test.drop(columns="Target")
y_test = data_test["Target"]
# The test data is comprised of only sensor data that is numerical data. No need to create dummy features
#Check the size of the data
print(f"There are {X_test.shape[0]} rows and {X_test.shape[1]} features in the data frame.")
There are 5000 rows and 40 features in the data frame.
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="median")
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
# Transform the validation data
# Using fit here will cause data leakage
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)
# Transform the test data
# Using fit here will cause darta leakage
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns)
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 ------------------------------ V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 ------------------------------ V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV
Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Dtree-Reg", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Reg", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Reg", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Reg", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Reg", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Reg", XGBClassifier(random_state=1, eval_metric="logloss")))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: Dtree-Reg: 0.6982829521679532 Bagging-Reg: 0.7210807301060529 Adaboost-Reg: 0.6309140754635308 GBM-Reg: 0.7066661857008874 RandomForest-Reg: 0.7235192266070268 Xgboost-Reg: 0.8100497799581561 Validation Performance: Dtree-Reg: 0.7050359712230215 Bagging-Reg: 0.7302158273381295 Adaboost-Reg: 0.6762589928057554 GBM-Reg: 0.7230215827338129 RandomForest-Reg: 0.7266187050359713 Xgboost-Reg: 0.8309352517985612 CPU times: user 3min 16s, sys: 7.71 s, total: 3min 24s Wall time: 3min 10s
print(f"Test Data Recall scores fitted on training data set")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_test, model.predict(X_test))
print("{}: {}".format(name, scores))
Test Data Recall scores fitted on training data set Dtree-Reg: 0.7127659574468085 Bagging-Reg: 0.6595744680851063 Adaboost-Reg: 0.6134751773049646 GBM-Reg: 0.6914893617021277 RandomForest-Reg: 0.7304964539007093 Xgboost-Reg: 0.8049645390070922
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
We can see that Xgboost is giving the highest cross-validated recall followed by Bagging and Random Forest
We will tune the Xgboost, Bagging, and Random Forest models and see if the performance improves
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 832 Before Oversampling, counts of label 'No': 14168 After Oversampling, counts of label 'Yes': 14168 After Oversampling, counts of label 'No': 14168 After Oversampling, the shape of train_X: (28336, 40) After Oversampling, the shape of train_y: (28336,)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Dtree-Over", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Over", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Over", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Over", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Over", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Over", XGBClassifier(random_state=1, eval_metric="logloss")))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: Dtree-Over: 0.9720494245534969 Bagging-Over: 0.9762141471581656 Adaboost-Over: 0.8978689011775473 GBM-Over: 0.9256068151319724 RandomForest-Over: 0.9839075260047615 Xgboost-Over: 0.9891305241357218 Validation Performance: Dtree-Over: 0.7769784172661871 Bagging-Over: 0.8345323741007195 Adaboost-Over: 0.8561151079136691 GBM-Over: 0.8776978417266187 RandomForest-Over: 0.8489208633093526 Xgboost-Over: 0.8669064748201439 CPU times: user 5min 27s, sys: 10.4 s, total: 5min 38s Wall time: 5min 21s
print(f"Test Data Recall scores using oversampled data sets")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_test, model.predict(X_test))
print("{}: {}".format(name, scores))
Test Data Recall scores using oversampled data sets Dtree-Over: 0.7659574468085106 Bagging-Over: 0.7695035460992907 Adaboost-Over: 0.8191489361702128 GBM-Over: 0.8546099290780141 RandomForest-Over: 0.8333333333333334 Xgboost-Over: 0.8368794326241135
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
print("Before Undersampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("After Undersampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Undersampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Undersampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Yes': 832 Before Undersampling, counts of label 'No': 14168 After Undersampling, counts of label 'Yes': 832 After Undersampling, counts of label 'No': 832 After Undersampling, the shape of train_X: (1664, 40) After Undersampling, the shape of train_y: (1664,)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Dtree-Under", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging-Under", BaggingClassifier(random_state=1)))
models.append(("Adaboost-Under", AdaBoostClassifier(random_state=1)))
models.append(("GBM-Under", GradientBoostingClassifier(random_state=1)))
models.append(("RandomForest-Under", RandomForestClassifier(random_state=1)))
models.append(("Xgboost-Under", XGBClassifier(random_state=1, eval_metric="logloss")))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: Dtree-Under: 0.8617776495202367 Bagging-Under: 0.8641945025611427 Adaboost-Under: 0.8666113556020489 GBM-Under: 0.8990621167303946 RandomForest-Under: 0.9038669648654498 Xgboost-Under: 0.9014717552846114 Validation Performance: Dtree-Under: 0.841726618705036 Bagging-Under: 0.8705035971223022 Adaboost-Under: 0.8489208633093526 GBM-Under: 0.8884892086330936 RandomForest-Under: 0.8920863309352518 Xgboost-Under: 0.89568345323741 CPU times: user 20.4 s, sys: 5.37 s, total: 25.8 s Wall time: 16.2 s
print(f"Test Data Recall scores for undersampled data sets")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_test, model.predict(X_test))
print("{}: {}".format(name, scores))
Test Data Recall scores for undersampled data sets Dtree-Under: 0.8085106382978723 Bagging-Under: 0.8581560283687943 Adaboost-Under: 0.8546099290780141 GBM-Under: 0.8687943262411347 RandomForest-Under: 0.875886524822695 Xgboost-Under: 0.875886524822695
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={
'n_estimators':[150,200,250],
'scale_pos_weight':[5,10],
'learning_rate':[0.1,0.2],
'gamma':[0,3,5],
'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs = -1,
scoring=scorer,
cv=5,
random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.2, 'gamma': 5} with CV score=0.8582136930957363:
CPU times: user 2.9 s, sys: 1.68 s, total: 4.58 s
Wall time: 1min 3s
# Create an XGB tuned classifier for the regular/original data set.
xgb2_reg_tuned = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.8,
scale_pos_weight=10,
n_estimators=200,
learning_rate=0.2,
gamma=5,
)
# Fit using the training data set.
xgb2_reg_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.2, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.2, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Check the model performance using the training data set.
xgb2_reg_tuned_train_perf = model_performance_classification_sklearn(
xgb2_reg_tuned, X_train, y_train
)
xgb2_under_tuned_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979 | 1.000 | 0.960 | 0.979 |
# Check the model performance using the validation data set.
xgb2_reg_tuned_val_perf = model_performance_classification_sklearn(
xgb2_reg_tuned, X_val, y_val
)
xgb2_reg_tuned_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.987 | 0.849 | 0.911 | 0.879 |
# Check the model performance using the test data set.
xgb2_reg_tuned_test_perf = model_performance_classification_sklearn(
xgb2_reg_tuned, X_test, y_test
)
xgb2_reg_tuned_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.987 | 0.837 | 0.922 | 0.877 |
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={
'n_estimators':[150,200,250],
'scale_pos_weight':[5,10],
'learning_rate':[0.1,0.2],
'gamma':[0,3,5],
'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs = -1,
scoring=scorer,
cv=5,
random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9290599523843879:
CPU times: user 1.71 s, sys: 1.11 s, total: 2.83 s
Wall time: 22.8 s
# Create an XGB tuned classifier for the underfitted data set.
xgb2_under_tuned = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=10,
n_estimators=200,
learning_rate=0.1,
gamma=5,
)
xgb2_under_tuned.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Check the model performance using the training underfitted data set.
xgb2_under_tuned_train_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_train_un, y_train_un
)
xgb2_under_tuned_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979 | 1.000 | 0.960 | 0.979 |
# Check the model performance using the validation underfitted data set.
xgb2_under_tuned_val_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_val, y_val
)
xgb2_under_tuned_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.832 | 0.921 | 0.239 | 0.379 |
# Check the model performance using the test underfitted data set.
xgb2_under_tuned_test_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_test, y_test
)
xgb2_under_tuned_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.834 | 0.890 | 0.239 | 0.376 |
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs = -1,
scoring=scorer,
cv=5,
random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2} with CV score=0.9038236779453142:
CPU times: user 993 ms, sys: 235 ms, total: 1.23 s
Wall time: 19.2 s
# Create an Gradient Boosting tuned classifier for the underfitted data set.
gbm_under_tuned = GradientBoostingClassifier(
random_state=1,
subsample=0.5,
n_estimators=125,
max_features=0.7,
learning_rate=0.2,
)
gbm_under_tuned.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.5)# Check the model performance using the training underfitted data set.
gbm_under_tuned_train_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_train_un, y_train_un
)
gbm_under_tuned_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979 | 1.000 | 0.960 | 0.979 |
# Check the model performance using the validation underfitted data set.
gbm_under_tuned_val_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_val, y_val
)
gbm_under_tuned_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.832 | 0.921 | 0.239 | 0.379 |
# Check the model performance using the test underfitted data set.
gbm_under_tuned_test_perf = model_performance_classification_sklearn(
xgb2_under_tuned, X_test, y_test
)
gbm_under_tuned_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.834 | 0.890 | 0.239 | 0.376 |
%%time
# defining model
Model = RandomForestClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs = -1,
scoring=scorer,
cv=5,
random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8990116153235697:
CPU times: user 1.27 s, sys: 186 ms, total: 1.46 s
Wall time: 25.9 s
# Create an Random Forest tuned classifier for the underfitted data set.
randomforest_under_tuned = RandomForestClassifier(
random_state=1,
n_estimators=300,
min_samples_leaf=2,
max_samples=.5,
max_features='sqrt',
)
randomforest_under_tuned.fit(X_train_un, y_train_un)
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
random_state=1)# Check the model performance using the training underfitted data set.
randomforest_under_tuned_train_perf = model_performance_classification_sklearn(
randomforest_under_tuned, X_train_un, y_train_un
)
randomforest_under_tuned_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.933 | 0.989 | 0.960 |
# Check the model performance using the validation underfitted data set.
randomforest_under_tuned_val_perf = model_performance_classification_sklearn(
randomforest_under_tuned, X_val, y_val
)
randomforest_under_tuned_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.885 | 0.468 | 0.612 |
# Check the model performance using the test underfitted data set.
randomforest_under_tuned_test_perf = model_performance_classification_sklearn(
randomforest_under_tuned, X_test, y_test
)
randomforest_under_tuned_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.944 | 0.879 | 0.500 | 0.638 |
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb2_reg_tuned_train_perf.T,
xgb2_under_tuned_train_perf.T,
gbm_under_tuned_train_perf.T,
randomforest_under_tuned_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Xgboost tuned with regular data",
"Xgboost tuned with undersampled data",
"Gradient Boost tuned with undersampled data",
"Random Forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Xgboost tuned with regular data | Xgboost tuned with undersampled data | Gradient Boost tuned with undersampled data | Random Forest tuned with undersampled data | |
|---|---|---|---|---|
| Accuracy | 0.999 | 0.979 | 0.979 | 0.961 |
| Recall | 1.000 | 1.000 | 1.000 | 0.933 |
| Precision | 0.974 | 0.960 | 0.960 | 0.989 |
| F1 | 0.987 | 0.979 | 0.979 | 0.960 |
# training performance comparison
models_val_comp_df = pd.concat(
[
xgb2_reg_tuned_val_perf.T,
xgb2_under_tuned_val_perf.T,
gbm_under_tuned_val_perf.T,
randomforest_under_tuned_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Xgboost tuned with regular data",
"Xgboost tuned with undersampled data",
"Gradient Boost tuned with undersampled data",
"Random Forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Xgboost tuned with regular data | Xgboost tuned with undersampled data | Gradient Boost tuned with undersampled data | Random Forest tuned with undersampled data | |
|---|---|---|---|---|
| Accuracy | 0.987 | 0.832 | 0.832 | 0.938 |
| Recall | 0.849 | 0.921 | 0.921 | 0.885 |
| Precision | 0.911 | 0.239 | 0.239 | 0.468 |
| F1 | 0.879 | 0.379 | 0.379 | 0.612 |
Now we have our final model, so let's find out how our final model is performing on unseen test data.
# training performance comparison
models_test_comp_df = pd.concat(
[
xgb2_reg_tuned_test_perf.T,
xgb2_under_tuned_test_perf.T,
gbm_under_tuned_test_perf.T,
randomforest_under_tuned_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Xgboost tuned with regular data",
"Xgboost tuned with undersampled data",
"Gradient Boost tuned with undersampled data",
"Random Forest tuned with undersampled data",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Xgboost tuned with regular data | Xgboost tuned with undersampled data | Gradient Boost tuned with undersampled data | Random Forest tuned with undersampled data | |
|---|---|---|---|---|
| Accuracy | 0.987 | 0.834 | 0.834 | 0.944 |
| Recall | 0.837 | 0.890 | 0.890 | 0.879 |
| Precision | 0.922 | 0.239 | 0.239 | 0.500 |
| F1 | 0.877 | 0.376 | 0.376 | 0.638 |
The selected model is the Random Forest tuned with undersampled data
The Random Forest had very good recall scores across the various sets of data suggesting it is minimizing overfitting and underfitting scenarios compared to the other models that may be scoring higher in some areas.
feature_names = X_train.columns
importances = randomforest_under_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Let's prepare a new data set that doesn't have imputed data, etc.
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]
# Since we already have a separate test set, we don't need to divide data into train and test
X_test1 = df_test.drop(columns='Target')
y_test1 = df_test['Target']
#Check the size of the data
print(f"There are {X1.shape[0]} rows and {X1.shape[1]} features in the data frame.")
There are 20000 rows and 40 features in the data frame.
#Check the size of the data
print(f"There are {Y1.shape[0]} features in the data frame.")
There are 20000 features in the data frame.
#Check the size of the data
print(f"There are {X_test1.shape[0]} rows and {X_test1.shape[1]} features in the data frame.")
There are 5000 rows and 40 features in the data frame.
#Check the size of the data
print(f"There are {y_test1.shape[0]} rows features in the data frame.")
There are 5000 rows features in the data frame.
# We need to impute the X1 (basically the training and validation together) data set to fix missing values.
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)
# Since the Random Forest using undersampling was selected as the final model, we need to also do
# underfitting with X1 and Y1
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_under1, y_under1 = rus.fit_resample(X1, Y1)
# Create the pipeline model that imputes the data and then creates the final Random Forest model.
Pipeline_model = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")),
("RandomForest_under", RandomForestClassifier(
random_state=1,
n_estimators=300,
min_samples_leaf=2,
max_samples=.5,
max_features='sqrt',)),
])
# Fit the model on the undersampled training data
# X_under1 and Y_under1 will run through the pipeline. These data sets will be imputed just in the pipeline.
# However, the original X_under1 will not be updated due to imputed step.
Pipeline_model.fit(X_under1, y_under1)
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('RandomForest_under',
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2,
n_estimators=300, random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('RandomForest_under',
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2,
n_estimators=300, random_state=1))])SimpleImputer(strategy='median')
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300,
random_state=1)# Let's check the performance on test set
Model_test = model_performance_classification_sklearn(Pipeline_model, X_test1, y_test1)
Model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945 | 0.872 | 0.507 | 0.641 |
The Random Forest - Under model can be used to help predict component failures.
Recommend ReneWind focus on these important features that can help determine failures:
Recommend ReneWind consider future feature engineering efforts of combining features with correlations such as: V15 and V18.